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Preface 


The  present  paper  can  best  be  described  as  an  informal  attempt  to 
alert  and/or  remind  research  staff  members  about  certain  statistical 
rationales  and  procedures  possibly  helpful  to  them  in  the  design  and 
analysis  of  research.  Since  the  primary  effort  involved  in  this  paper  is 
heuristic  rather  than  pedagogical,  theoretical  notes  have  been  footnoted 
or  referenced  rather  than  dealt  with  in  the  body' of  the  paper  and  an  in¬ 
tuitive  rather  than  a  derivative  approach  has  guided  selection  of  the 
expository  material. 

This  paper,  like  Gaul,  is  divided  into  three  parts.  The  first  part 
deals  with  the  necessity  of  computing  the  number  of  subjects  required 
for  a  given  experiment.  Rationales  and  procedures  are  presented  therein. 
The  second  part  explains  the  advantages  of  having  equal  numbers  of  sub* 
jects  in  experimental  treatment  "conditions"  or  "cells,"  and  shows  to 
what  extent  bias  may  enter  into  the  analysis  of  unequal  cell  frequencies. 
The  third  part  outlines  the  sometimes  drastic  effect  of  making  multiple 
comparisons  on  the  same  data  and  suggests  some  alternate  procedures. 

It  should  be  pointed  out  that  all  three  parts  deal  with  decisions 
solely  under  the  control  of  the  researcher  and  can  be  made  in  advance  of 
data  collection.  By  making  decisions  of  this  sort,  analysis  and  inter¬ 
pretation  of  results  can  be  greatly  facilitated. 

Comments  on  the  utility  of  the  procedures  discussed  herein  would  be 


appreciated 
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Part  I  -  How  Many  Subjects  Do  We  Need? 
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Researchers,  being  human,  sometimes  forget  to  have  their  car  serviced; 
or  miss  their  annual  checkup  at  the  physician's  or  dentist's  office.  If 
they  were  asked  to  supply  the  reason(s)  for  forgetting  this  "preventive 
maintenance"  on  their  car,  body,  or  teeth,  we  would  undoubtedly  receive 
quite  a  range  of  replies.  Such  might  include  some  thin  rationalizations — 
easily  transparent  to  the  amateur— or  might  consist  of  intricately  rea¬ 
soned,  balanced  structures  of  thesis,  antithesis,  and  synthesis.  All 
would  have  in  common  two  contradictory  thoughts,  however:  l)  It  is  im¬ 
portant  and,  of  course,  I  know  about  the  need  for  it;  and  2)  You  don't 
have  to  make  such  a  fuss  over  it — it  isn't  that  necessary. 

Against  this  background  let  us  see  what  "preventive  maintenance"  as 
applied  to  the  design  of  research  has  to  offer.  Most  researchers  will 
carefully  assume  the  operation  of  Kelley's  Law1  in  the  planning  of  their 
research.  Given  this  tendency,  one  finds  that  large  numbers  of  subjects 
(Ss)  are  often  run  in  order  for  the  researcher  or  experimenter  (E)  to 
have  the  highest  confidence  in  the  generality  of  his  findings.  But  there 
is  Research^  and  Researchg.  The  former  may  be  considered  as  the  case  in 
which  additional  data  collection  is  relatively  inexpensive — in  terms  of 
the  E's  time  and  effort,  and  hidden  costs  (such  as  preparation  of  addi¬ 
tional  experimental  materials,  data  analysis,  overhead,  etc.).  The  latter 
type  of  research.  Research^,  has  the  reverse  characteristics — extra  effort 
in  data  collection  (such  as  running  "additional"  Ss)  may  be  quite  costly. 


1 /  "If  anything  can  go  wrong... it  will 
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In  certain  cases  running  many  Ss  may  introduce  subtle  biases  in  the 
data;  at  the  extreme  is  the  case  when  prolonged  observation  may  result 
in  a  change  in  the  phenomenon  being  studied. 

Let  us  hypothesize  a  case  of  Researd^.  Suppose  we  have  inaugu¬ 
rated  a  study  to  determine  the  effects  of  "Training  Method  A"  on 
student  proficiency  in  a  course  in  a  particular  subject.  We  choose  a 
USCONARC  school  as  the  site  of  our  study,  and  we  take  one-half  of  the 
input  to  that  school  for  a  given  period  of  time,  eventually  comparing 
the  effects  of  the  administration  of  "Training  Method  A"  with  the  pro¬ 
ficiency  of  the  group  receiving  the  conventional  training.  Quite 
apart  from  the  design  of  such  a  study — which  has  been  dealt  with  else¬ 
where  (MacCaslin  &  Cogan,  1964) — what  are  some  of  the  implications  of 
administering  Training  Method  A  to  such  a  large  group  of  Ss?  These 
may  be  listed  as  follows; 

1.  The  "experimental"  group,  by  virtue  of  knowledge  of  receiv¬ 
ing  a  special  or  unusual  training  method,  may  reach  inflated  levels 

of  proficiency  quite  apart  from  the  effects  of  the  method  itself.  This 
is  the  well-known  "Hawthorne  Effect"  (cited  in  Lindzey,  1954,  pp.  1105- 
1106). 

2.  The  possibility  of  a  reverse  effect — performance  deficit 
under  "guinea  pig"  conditions — may  not  be  automatically  ruled  out, 
however.  This  is  especially  true  where  data  collection  covers  a  pro¬ 
longed  period  of  time.  In  such  cases  the  people  providing  the  local 


support  may  lose  whatever  original  enthusiasm  they  once  had,  and  subtly 

2 

transmit  this  loss  of  enthusiasm  tc  the  S-population  . 

3.  Hie  cost  of  running  one-half  of  the  student  input  may  be  un¬ 
justified.  Perhaps  one-fifth  as  large  a  group  would  have  sufficed  in 

3 

order  to  achieve  "statistically  significant"  results  .  In  the  dis¬ 
cussion  of  quality  and  quantity  of  support  needed  that  inevitably  must 
occur  between  HumRRO  and  USCONARC,  the  question  of  "How  many  subjects 
do  you  need?"  is  a  thorny  one.  It  may  be  just  a  bit  too  easy  on  our¬ 
selves  to  reply,  "As  many  as  we  can  get." 


2/  Current  events  may  also  be  fatal  to  the  prolongation  of  data 
collection.  An  informal  report  (Ayres,  1964)  indicated  that 
correlations  of  pre-  and  post-test  of  Supervisory  techniques 
which  were  averaging  in  the  upper  .60's  dropped  to  ge-m  fm-  a 
group  which  took  its  post-test  on  the  afternoon  of  22  November 
1963,  and  had  listened  on  the  radio  during  lunch  to  the  events 
transpiring  in  Dallas,  Texas. 

3/  Since  the  sociology  of  psychological  research  is  what  it  is, 
and  since  the  E  is  probably  going  to  attempt  to  publish  his 
findings,  he  probably  will  want  to  indicate  the  probability  of 
a  Type  I  error,  the  probability  of  rejecting  the  Null  Hypothesis 
when  it  is  In  fact  true.  A  second  model,  based  on  establishing 
confidence  limits  on  a  difference  score  (i.e.,  the  difference 
between  a  and  b  is  x  units  or  more  at  the  .05  level)  also  exists 


but  will  not  be  discussed  herein. 


We  are  in  a  much  stronger  position,  operationally  speaking,  when 
we  can  specify,  the  numbers  of  subjects  we  need  for  a  given  research 
project.  There  are  several  questions  one  would  wish  to  take  into. ac¬ 
count  when  thinking  about  such  specification; 

1.  How  large  a  difference  between  Group  A  and  Group  B  would  we 

consider  a  meaningful  difference?  This  question  assumes  that  the  E 
*  ~ 
will  statistically  test  his  data  at  customary  levels  of  significance; 

the  "difference"  referred  to  here  is  determined  on  psychological  and 

economic  grounds,  not  on  statistical  ones. 

2.  What  sort  of  variance  in  the  measurement  of  performance 
(proficiency)  can  we  realistically  expect?  Here  we  begin  to  consider 
the  variability  of  the  two  methods  to  be  compared.  Will  each  method 
yield  equally  variable  data,  or  is  one  subject  to  greater  fluctuation? 
How  variable  will  the  yield  be? 

3.  What  sort  of  risk  are  *we  willing  to  take,  statistically  speak 
ing,  in  claiming  that  there  is  a  difference  between  the  groups  when 
there  is  in  fact  no  difference  ( and  the  apparent  difference  is  due  to 
chance  curv, random  fluctuation)?  This  is  the  risk  of  a  Type  I  error, 
the  error  of  rejecting  the  Null  Hypothesis  when  it  is  in  fact  true, 

as  tested  by  the  customary  "Level  of  Significance."  It  is  sometimes 
also  called  an  "alpha  error." 

4.  What  sort  of  a  risk  are  we  willing  to  take,  in  analogous 
manner  to  Type  I  error,  in  claiming  that  there  is  no  difference  be¬ 
tween  the  groups  when  there  is  in  fact  a  difference?  This  is  the  risk 
of  a  Type  II  error,  the  error  of  failing  to  reject  the  Null  Hypothesis 


when  it  is  in  fact  false „  This  error  is  sometimes  called  a  "beta 
error."  By  taking  the  complement  of  the  Type  II  error  (1.0  minus  the 
Type  II  error).,  we  obtain  the  "power"  of  the  statistical  test — or  the 
probability  that  we  will  not  commit  a  Type  II  error. 

Now  what  has  all  of  the  above  to  do  with  research  done  at  HumRRO? 
In  general  we  have  agreed  that  it  would  indeed  be  helpful  to  know  how 
many  subjects  we  need— if  not  for  the  research  proper,  then  for  the 
economics  of  the  research  and  for  the  liaison  value  of  this  knowledge. 
One  of  the  purposes  of  this  note  is  to  demonstrate  the  ease  of  com¬ 
puting  the  number  of  subjects  needed,  given  answers  and  the  above  four 
questions . 

In  general,  an  Exploratory  Study  (ES)  precedes  the  Task  Concep¬ 
tualization  Paper  (TCP).  As  set  forth  above,  there  are  four  criteria 
for  determining  the  number  of  Ss  needed  for  a  given  research  project. 
The  most  difficult  question  is  that  posed  by  Question  No.  2,  which 
asks,  "What  is  the  expected  variance  of  the  observations?"  On  the 
basis  of  the  ES,  the  researcher  should  have  some  approximation  of  the 
variance;  answers  to  the  other  questions  are  set  by  the  E  himself  fol¬ 
lowing  psychological  and/or  economic  rationales.  Calculation  of  the 
number  of  Ss  required  is  then  a  simple  and  straightforward  procedure. 

As  an  illustration,  the  following  example  is  provided  for  a  two- 
sided  tesx  of  significance  (where  the  E  cannot  or  will  not  predict  in 
advance  which  group  will  have  a  higher  "score"). 
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Computing  Formula: 


2s  2 

N  =  — g-C*^  +  O 
d  2^  B 


Where  N  =  number  of  subjects  needed 

1)  2s2  =  Two  times  the  expected  or  estimated  variance  of  the  observa¬ 

tions  (proficiency  scores,  number  of  items  correct  on  a  test 
of  transfer,  etc.) 

2)  d  =  The  smallest  difference  ( squared)  between  the  two  groups 

that  the  researcher  wishes  to  be  able  to  detect. 

3)  zi  =  One-half  the  usual  Level  of  Significance;  if  we  are  working 

2s 

at  the  p  =  .05  level,  then  zy  ,  as  obtained  from  tables  of 

2s- 

percentile  values  of  the  normal  curve,  equals  1.96. 

4)  z^  =  Risk  of  Type  II  error;  if  we  wish  this  to  be  .10,  then  zB 

=  1.282.  This  value  is  obtained  analogously  from  the  tables 

of  percentile  values  of  the  normal  curve. 

Note:  Both  "z-values"  ( z x  and  z  )  are  always  positive. 

2s  B 

Example:  We  have  decided,  on  non-statistical  grounds,  to  look  for  a 

difference  of  20  units;  this  is  the  smallest  difference  which  we 
would  consider  as  worth  finding.  We  have  also  decided  to  use  the 
customary  .05  Level  of  Significance,  and  wish  to  limit  our  chances 
of  making  a  Type  II  error  to  a  probability  of  10$  (or.,10).  We 


kj  From  Walker  &  Lev  (1953)>  p.  166.  There  is  an  analogous  procedure 


for  one-sided  tests.  The  logic  of  the  computation  is  more  fully 


discussed  therein 
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have  also  determined,  via  an  ES  that  the  standard  deviation  will 
he  about  20  units  and  thus  the  variance  is  (20)  =  400  units 

(the  same  units  as  for  the  difference  above).  For  a  two-sided 
test  which  we  will  eventually  make,  the  values  of  the  above 
symbols  are: 

1)  =  bOO 
.  2 

2)  =  20  =  400 

3)  =  1.96 

b)  -  1.282,  therefore: 

N  =  Si!i££2(l.96  +  1.282)2=  2(3 .2b2')2 ■=  21.02  or 
400  ' 


N  =  22  subjects  per  group 

(Note:  This  suggests  one  should  choose  about  25  Ss 
per  group.  The  main  point  here  is  to  determine  an 
order  cf  magnitude — what  ballpark  do  we  play  in? — 
rather  than  absolute  numbers.) 

Now  suppose  we  had  wished  to  work  with  different  values;  let  us  esti¬ 
mate  our  variance  as  250.  Our  "smallest  significant  difference"  will  now 


be  10  units.  Our  Level  of  Significance  will  remain  the  same,  but  we  have 
decided  that  we  can  "afford"  to  risk  a  Type  IT  error  with  a  probability 
of  25$  (or  .25)  this  time.  What  N  do  we  need?  We  now  substitute: 

1)  250 

2)  10  X  10  =  100 


3)  1.96 

4)  .6745 


N  =  lfo (1-96  4  -6745)? 

2 

=  5(2.6345)  *  5(6.94)  =  34.70  or 


N  =■  35-40  subjects  per  group 
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As  a  perhaps  interesting  sidelight,  the  procedure  generalizes  to 
comparisons  of  more  than  two  groups  (the  n-group  case).  The  obtained 
N  refers  to  the  number  of  subjects  we  need  in  each  group  regardless  of 
whether  we  have  two  groups  and  intend  to  make  a  t-test,  or  have  n  groups 
and  intend  to  do  an  Analysis  of  Variance. ^ 

At  this  juncture,  the  ease  of  computation  is  probably  mere  apparent 
than  the  need.  Some  case  histories,  a  free  blend  of  reality,  disguise, 
and  fantasy,  are  presented  below  in  order  to  "document"  the  need. 


"Case  History  #1" 

Jim  'Jones,  researcher  sans  peur,  gets  into  a  conversation  with 
Lt..  Col.  Brown  on — of  all  things— the  number  of  subjects  Jones  needs  at. 
his  school.  Jones  says,  hopefully.  "As  many  as  we  can  get,  uh,  prefer¬ 
ably  750  to  1000."  Brown:  "How  long  do  you  plan  to  use  them,  and  how 
many  at  a  time?"  Jones;  "100  per  week  for  10  weeks,  if  possible." 

Brown:  "I'm  afraid  that's  out  since  we  can't  handle  an  additional  100 
students  here  without  strain.  We  have  to  set  up  support  companies,  find 
cooks  and  bakers,  first  sergeants,  commanding  officers,  etc.  No,  (shaking 
his  head)  I'm  sorry."  Jones:  "How  about  500?"  (The  rest  cf  the  con¬ 
versation  is  fairly  predictable.) 


"Case  History  #1  Alternate" 

Locale  and  setting  as  above,  with  slight  modification  in  past  his¬ 
tory  of  events  leading  up  to  conversation.  Jones:  "We've  computed  the 


5 /  Personal  communication,  Dr.  Eugene  Cogan,  November  1964. 


number  of  subjects  we  need,  and  the  minimum  number  we  must  have  is  440, 
or  20  subjects  in  each  of  22  groups."  Brown:  "How  long  do  you  need 
them  for?"  Jones:  "44  per  week  for  10  weeks,  if  possible."  Brown; 
"Maybe  if  we  push  it,  we  can  arrange  to  overman  a  few  companies  by  20 
students  for  10  weeks.  Are  you  sure  you  can't  do  it  with  less  subjects?" 
(Jones  explains  his  calculations  and  goes  over  the  figures  with  Brown. 
Brown  is  reassured  and  now  can  use  this  information  in  his  discussion 
with  Col.  Green.) 

"Case  History  #2" 

Programed  instruction  materials  are  to  be  prepared  to  collect  data 
on  programing  vs.  conventional  methods  of  teaching  cost  accounting.  How 
many  booklets  should  be  printed?  Jones  calculates  the  number  of  Ss  he 
needs j  adds  a  percentage  for  attrition,  misprinted  booklets,  etc.;  and 
presents  his  figures  to  his  D/R,  who  carefully  notes  the  thoroughness  of 
Jones.  (This  is  of  course  reflected  in  Jones’  next  year’s  salary 
recommendation. ) 

"Case  History  #3" 

It  has  been  determined  that  the  cost  of  running  a  S  in  Task  FIGMENT 
is  $42.  Jones  wishes  to  run  Ss  (using  sequential  analysis)  only  until  he 
attains  "statistically  significant"  results;  this  consideration  is  based 
primarily  on  economic  reasons,  since  adding  Ss  needlessly  rapidly  reaches 
the  point  of  diminished  returns.  After  computation,  he  determines  that 
the  total  outlay  for  data  collection  is  too  large  and  devises  alternate 
ways  to  collect  data,  paring  costs  of  collection  to  $18.75  per  S. 


Jones,  having  been  burnt  in  the  past,  decides  to  run  10  Ss  per 
group  in  a  rather  arbitrary  way.  Remembering  Kelley's  Law,  he  computes 
the  N  he  needs,  finding  to  his  horror  it  is  22  Ss  per  group.  If  he  ran 
10  Ss  in  each  group,  he  would  have  virtually  no  chance  of  achieving 
acceptable  levels  of  significance. 

Concluding  Comments; 

For  the  amount  of  potential  gain,  measured  against  the  amount  of 
time  taken  in  the  process,  one  of  the  best  researcher  bets  is  the  com¬ 
putation  of  the  number  of  Ss  he  needs  in  his  task.  Like  preventive 
maintenance,  this  concept  is  often  widely  acknowledged  and  unwisely 
avoided.  Or  forgotten.  Like  Kelley's  Law. 
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Part  II  -  Why  Try  for  Equality  of  Cell  Frequencies? 

It  is  each  researcher's  inalienable  right  to  distribute  his  sub¬ 
jects  (Ss)  across  treatment  conditions  as  he  sees  fit.  But,  Just  as 
in  deviating  from  the  rules  of  bidding  in  bridge,  he  must  have  some 
rationale  for  deviating  from  the  "rules"  of  assignment  of  Ss.  These 
"rules,"  though  not  particularly  stringent,  have  been  established  much 
in  the  same  way  as  most  rules— they  have  been  found  to  work  at  an  em¬ 
pirical  level.  The  cardinal  principle  is,  of  course,  to  have  an  equal 
number  of  Ss  per  cell  (or  treatment).  Let  us  see  what  happens  when 
this  principle  is  violated. 

Let  us  suppose  that  we  have  two  groups,  one  of  15  Ss  and  the  other 
of  5  Ss.  We  examine  the  mean  difference  between  these  two  groups  by 
means  of  a  t-test,  noting  that  there  is  heterogeneity  of  variance.  Al¬ 
though  the  variance  from  the  large  group  is  five  times  as  small  as  the 
variance  of  the  small  group,  we  find  a  "significant"  difference  at 
p  =  .05.  Surely  there  is  nothing  wrong  with  this!  Yet,  as  examination 
of  Table  1  will  show,  there  is.  Our  "real"  level  of  significance,  as 
theoretically  determined,  is  p  s  .18  and  we  have  erroneously  rejected 
the  Null  Hypothesis.  Were  the  train  Of  events  to  halt  there,  little 
would  be  lost;  but  we  usually  take  some  further  action  on  the  basis  of 
our  study,  making  some  change  in  a  training  program.  The  ramifications 
spread.  (Note  that  if  we  had  used  J  Ss  in  each  group— fewer  subjects!  — 
we  would  have  been  able  to  reject  the  Null  Hypothesis  at  the  .063  level.) 
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Now  suppose  we  had  two  large  hut  unequal  groups,  one  twice  the 
size  of  the  other,  and  with  the  variance  of  the  smaller  group  five  times 
that  of  the  larger  group.  By  examination  of  Table  2  we  can  see  that  our 
nominal  5$  level  is  in  reality  a  12$  level.  And  so  it  goes. 

By  examination  of  the  two  Tables,  the  reader  thus  far  has  dis¬ 
covered  for  himself  three  points:  (l)  A  level  of  significance  can  be 
biased  on  the  "safe"  side — as  when  there  are  two  unequal  groups  and  the 
large  variance  comes  from  the  large  group;  (2)  A  level  of  significance 
can  be  biased  on  the  "unsafe"  side— when  the  larger  variance  comes  from 
the  smaller  of  two  unequal  groups;  and  (3)  There  need  be  little  or  no 

bias  in  the  level  of  significance  if  the  sample  sizes  are  equal.  Table 

7 

3  generalizes  the  above  to  the  three-  and  five-sample  case. 


6/  Most  statistical  theorists,  when  they  speak  of  "large  n"  refer  to 
sample  sizes  of  20  to  30.  Scheff4,  however,  speaks  of  "large  n"  in 
asymptotic  terms;  i.e.,  as  n  approaches  infinity. 

jJ  Lindquist  (1953)  speaks  of  the  effects  of  violating  assumptions 
underlying  the  F  test,  citing  the  Norton  study — by  now  a  classic 
(pp.  78-86).  Even  when  the  shapes  of  the  curves  and  the  variances 
in  the  n-sample  case  are  very  discrepant,  normal-theory  statistics 
still  provide  a  remarkably  good  fit— provided  the  sample  sizes  are 
equal.  See  also  Boneau  (i960)  in  a  readable  article  dealing  with 
these  matters  applied  to  t-tests. 
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Part  III  -  Milking  the  Data,  or 
Effects  of  Making  Multiple  Tests  on  the  Same  Study 

In  studies  of  psychological  phenomena,  we  try  to  arrange  things 
such  that  we  get  the  best  return  for  our  research  dollar.  This  often 
means  that  we  run  an  Analysis  of  Variance  (ANOVA)  design  sc  that  we  can 
assess  the  effect(s)  of  the  main  treatments )  as  well  as  any  inter¬ 
actions  that  may  occur.  With  ANOVA  designs  we  usually  run  into  the 
problem  of  making  multiple  comparisons — we  wish  to  know  which  cells  of 
subtreatment  combinations  are  more  effective  in  terms  of  our  independent 
variable(s).  The  tendency  may  arise  to  "milk  the  data."  i.e.,  make  as 
many  tests  of  significance  as  seems  suggested  by  whatever  psychological 
rationales  we  may  have.  Even  when  these  decisions  are  taken  a  priori, 
there  is  an  upper  limit  as  to  the  number  of  tests  we  may  make.  This 
limit  is  given  by  k-l,  where  k  Ls  the  number  of  means  (of  cells  or 
treatments)  obtained  in  the  study.  The  rationale  is  best  given  by 
Walker  and  Lev  (1953)  thusly: 

If  as  many  comparisons  are  formed  as  there  are  degrees  of 
freedom  (in  this  case,  k-l),  then  the  sums  of  squares  of  a 
set  of  orthogonal  comparisons  constitute  a  complete  sub¬ 
division  of  the  total  sum  of  squares.  It  should  be  noted 
that  orthogonal  sets  of  comparisons  can  be  made  up  tn  an 
endless'  number  of  ways  (p.  357",  italics  and  parenthetical 
comment  added)". 

In  other  words,  the  limit  on  the  number  of  independent  comparisons  that 
can  be  made  is  given  by  k-l;  even  though  we  may  choose  to  make  various 


•v*v 
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comparisons  (perhaps  combining  certain  cell  means  with  those  of  others 
prior  to  comparisons).,  we  are  limited  in  the  total  number  of  such 
comparisons 

Does  the  foregoing  imply  that  we  can  promiscuously  make  t-tests 
until  we  reach  the  magic  number  of  (k-l)?  Not.  at  all.  For  every  t-test 
we  make  we  increase  the  chances  of  making  a  Type  I  error — of  erroneously 
rejecting  the  Null  Hypothesis.  This  is  intuitively  clear  when  we  real¬ 
ize  that  for  every  100  such  t-tests  we  make  we  shall  find  significant 
differences  in  five  of  them — if  we  are  working  at  the  %  level — on  the 
basis  of  "chance"  alone.  The  probability  that,  one  or  more  of  these 
tests  will  be  "significant"  at  the  5#  level  approaches  unity  rapidly  as 
the  number  of  tests  increase.  For  each  test  of  significance  we  make, 
we  reduce  the  level  of  significance  by  a  factor  of  ( 1-alpha)  ,  where 

8/  A  comment  by  Dr.  Eugene  A.  Cogan  on  the  above  is  reproduced  in 
its  entirety  below.  It  represents  an  alternate  way  of  looking  at 
the  problem,  and  the  author  is  grateful,  to  Dr.  Cogan  for  suggest¬ 
ing  it. 

"There  is,  however,  another  case  that  is  best  treated  by  sub¬ 
division  of  sums  of  squares.  That,  is,  if  five  groups  are  run  and 
there  is  specific  interest  in  comparing  Groups  One  and  Two  and  in 
comparing  Groups  Three  and  Five,  and  a  rather  diffuse  interest  in 
’everything  else,'  it  is  possible  to  subdivide  the  four  degrees 
of  freedom  into  specific  tests  for  the  specified  effects  and  also 
2  df  for  'the  rest.'  This  does  not  require  that  significant  over¬ 
all  F  ratio  be  evident;  in  fact,  the  method  suggests  we  do  not 
even  bother  to  compute  the  overall  F." 


h  equals  the  number  of  comparisons  we  make  and  alpha  is  the  level  of 

q 

significance  chosen.  In  order  to  demonstrate  the  effects  of  multiple 
testing  on  the  same  data,  Table  1  is  offered.  This  table  shows  new 
levels  of  significance,  given  a  nominal  level  of  %  and  1$,  and  repre¬ 
sents  a  straightforward  computation  of  1-  ^(l-alpha)^J  for  h  campariaons . 

It  will  be  noted  that  this  table  has  the  satisfying  property  of  showing 
that,  as  h  becomes  very  large,  the  probability  of  making  a  Type  I  error 
approaches  unity.  It  also  shows  that  this  approach  is  much  less  rapid 
when  working  at  the  1%  level. 

The  problem  of  multiple  comparisons  is  still  not  resolved  in  the 
psychological  literature  (Ryan,  1959*  1962;  Wilson,  1962).  Ryan  (1959) 
lists  five  different  cases  in  which  multiple  comparisons  are  made. 

Vi 

9/  The  Intuitive  rationale  underlying  the  factor  of  ( 1-alpha)11  has  to 
do  with  the  combination  of  independent  probabilities.  "If  and  X^ 
are  independent  observations,  the  joint  probability  that  will  be 
in  and  Xg  in  C2  is  the  product  of  their  separate  probabilities 
(Walker  &  Lev,  1953*  p.  15) •"  Since  the  level  of  significance — or 
probability— is  unchanged  over  several  comparisons,  we  multiply  the 
factor  by  itself  for  as  many  comparisons  as  we  make. 

10/  The  implication  seems  clear.  It  is  better  to  avoid  fishing  expedi¬ 
tions  in  which  trivial  comparisons  are  made  along  with  important  ones. 
By  so  doing,  one  (1)  has  more  confidence  that  Type  I  error  has  not 
been  inflated;  (2)  avoids  interpretation  of  the  trivial  effects, 
whether  "significant"  or  not;  and  (3)  focuses  attention  on  the  com¬ 
parisons  which  are  central  to  the  purposes  of  the  study. 
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Although  his  paper  is  restricted  to  the  case  in  which  several  different 
groups  are  to  he  compared  (as  in  a  simple-randomized  design)  he  has  sev¬ 
eral  points  which  generalize  to  other  kinds  of  analyses. 

He  points  out  that  the  difference  between  a  priori  and  a  posteriori 
comparisons  is  slight.  Early  workers  in  the  field  suggest  that  when  an 
overall  F  test  is  significant  one  can  make  t-tests  between  treatment 
means;  this  method  would  he  incorrect  if  the  experimenter  had  not  speci¬ 
fied  which  tests  he  would  make  in  advance  of  data  inspection.  Ryan 
suggests  that  this  line  of  reasoning  closely  parallels  that  of  making 
one-  or  two-sided  tests: 

...The  one-tailed  test  is  appropriate  only  if  the  direc¬ 
tion  of  difference  is  predicted  in  advance,  and  if  the 
experimenter  is  willing  to  overlook  any  difference  in  the 
opposite  direction,  no  matter  how  large.  Only  two  conclusions 
are  possible  from  the  data  when  a  one-tailed  test  is  used — 
either  there  is  a  difference  in  the  predicted  direction,  or  the 
results  of  the  experiment  are  inconclusive.  In  effect,  the  ex¬ 
periment  cannot  obtain  results  which  are  considered  a  significant 
refutation  of  the  prediction.  If  the  experimenter  allows  for  the 
possibility  of  a  result  that  contradicts  his  hypothesis,  he  must 
use  a  two-tailed  test,  and  there  is  no  difference  in  method  of 
analysis  from  that, . . .where  no  predictions  are  made  in  advance. 


In  the  case  of  more  than  two  means,  the  number  of  possible 
conclusions  is  increased.  We  may  have  not  only  confirmation  or 
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contradiction  of  the  prediction,  but  we  may  also  have  varying 
degrees  of  partial  agreement  with  the  prediction...  (and)  the 
situation  is  reduced  to  essentially  the  a  posteriori  case. 

Only  if  the  experimenter  states  in  advance  all  possible  con¬ 
clusions  and  the  rules  by  which  these  conclusions  will  be 
drawn,  would  he  have  an  a  priori  test  (p.  28;  italics  added). 

The  implication  is  clear:  If  the  experimenter  is  going  to  make  multi¬ 
ple  comparisons  after  finding  a  significant  F  in  his  Analysis  of  Variance, 
he  must  have  specified  in  advance  which  t-tests  he  will  make,  what  con¬ 
clusions  he  will  draw  if  only  some,  all,  or  none  are  significant,  taking 
into  account  all  possible  ways  the  results  of  his  tests  could  come  out 
and  the  interpretations  he  would  make  in  each  case.  (With  six  means  there 
are  five  orthogonal  comparisons  possible,  each  of  which  has  three  possible 
outcomes:  significantly  in  favor,  significantly  against,  and  inconclusive 
with  respect  to  the  hypothesis.  This  means  that  the  experimenter  has  to 
make  fifteen  interpretations  in  advance  1) 

In  addition,  most  workers  in  the  field  assume  that  there  is  only  one 
type  of  a  Type  I  error;  Ryan  points  out  that  there  are  three,  calling 
these  "error  rates." 


"1.  Error  rate  per  comparison.  This  is  the  probability 
that  any  one  of  the  comparisons  will  be  incorrectly  considered 
to  be  significant.... 

"2.  Error  rate  per  experiment.  This  is  the  long-run 
average  number  of  erroneous  statements  per  experiment.  In 
statistical  jargon  it  is  the  expected  number  of  errors  per 
experiment.  Unlike  the  first  error  rate,  which  is  a  probability, 
the  error  rate  per  experiment  could  be  greater  than  one.  That 
is,  we  could  set  a  criterion  of  "significance"  in  such  a  way  that 
we  would  average  three  false  statements  per  experiment. 


"3.  Error  rate  experimentwise.  This  is  the  probability 
that  one  or  more  erroneous  conclusions  will  be  drawn  in  a  given 
experiment.  In  other  words,  experiments  are  divided  into  two 
classes:  (a)  those  in  whi^  „all  conclusions  are  correct,  and 
(b)  those  in  which  some  exclusions  are  incorrect.  The  error 
rate  experimentwise  is  the  probability  that  a  given  experiment 
belongs  in  Class  (b)."  (p.  29;  all  italics  his) 


Now  we  can  see  why  it  is  incorrect  to  make  k-1  t-tests  (on  k  means) 
following  a  significant  F  at  the  .05  level.  The  significant  F  tells  us 
that  our  probability  of  error  rate  experimentwise,  (3)  above,  is  .05, 
but  says  nothing  about  the  individual  comparisons  made.  For  than  error 
rate  we  must  separately  consider  the  error  rate  per  comparison,  (1)  above, 
combining  the  probabilities  of  each.  Thus,  if  we  make  5  t-tests  follow¬ 
ing  a  significant  F  (at  the  .05  level  of  significance),  the  chances  are 
22.6  out  of  100  that  we  will  erroneously  reject  the  Null  Hypothesis  one 
or  more  times  (see  Ryan,  1959 >  P*  3l)>  using  the  experiment  as  the  unit 
of  our  analysis. 


A  Procedural  Note 

The  only  safe  general  procedure  for  making  multiple  comparisons 
would  appear  to  be  to  use  the  studentized  range  test,  which  is  a  test 
referring  to  a  probability  distribution  of  the  range  of  k  means,  based 
on  an  estimated  standard  error  of  the  mean  (Ryan,  1959»  P«  ^3)  •  The 
studentized  range  test  essentially  compares  the  greatest  range  in  the  ob¬ 
tained  means  with  those  of  the  theoretical  probability  distributions  for 

v  11 

k  means. 


11/  Thus  keeping  the  experiment  as  the  unit  on  which  our  error  rate  is 
based.  The  studentized  range  test  yields  the  probability  that  (at 
our  selected  alpha  level)  one  or  more  comparisons  will  be  significant, 
(See  "Error  rate  experimentwise,"  above.) 


Tables  for  the  studentized  range  test  may  be  found  in  Snedecor  (1956, 
p.  252)  and  Dixon  and  Massey  (1957.*  p.  1^0 ).  In  addition,  the  latter 
source  has  a  discussion  of  theory  and  procedure  on  pp.  152-155. 

Conclusions  and  Summary 

We  have  briefly  reviewed  three  of  the  many  areas  of  decision  which 
confront  the  researcher  prior  to  his  assumption  of  the  role  of  experimenter. 
He  must  decide  whether  or  not  to  compute  the  number  of  subjects  he  will 
need  to  attain  given  levels  of  significance,  how  to  distribute  these  sub¬ 
jects  across  experimental  treatments,  and  finally  how  many  and  what  kind 
of  statistical  analyses  to  make.  Each  of  these  areas  will  have  a  differ¬ 
ent  relative  weight  for  different  researchers,  but  all  are  sources  of 
potential  bias  and,  therefore,  must  be  taken  into  account. 

It  was  recommended  that  the  number  of  subjects  needed  be  pre-computed  • 
whenever  possible.  It  was  also  recommended  that  equal  numbers  of  subjects 
be  assigned  to  the  several  treatment  conditions  (if  Inequalities  across 
cells  exist  after  data  collection  due  to  attrition,  random  elimination  of 
subjects  to  reduce  to  equal  cell-frequencies  may  be  possible) .  It  was 
further  recommended  that  the  studentized  range  test  be  used  whenever  mul¬ 
tiple  comparisons  must  be  made,  and  certain  caveats  were  noted  with  respect 
to  multiple  testing  of  experimental  hypotheses. 
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Table  1 


New  Levels  of  Significance  When  Unequal  Population  Variances  Exist  Given 
Small  Samples  of  Unequal  Size  at  a  Nominal  5#  Significance  Level* 


Ratio  of  Variance  1  to  Variance  2 


Size  of 


Samples 

1:10 

1:5 

1:2 

1:1 

2:1 

5:1 

10:1 

15* 

5 

23# 

(Biased 

18# 

on  "unsafe” 

9.8#  ! 
side)  1 

5# 

2.5# 

(Biased 

0.8# 

on  "safe" 

0.5# 

side) 

5, 

3 

14# 

10*. 

7.2#  ! 

i 

5# 

3.8# 

3.1# 

3.0# 

7* 

6 

7# 

6.3# 

5.8# 

5# 

5.1# 

5.8# 

6.3# 

*Adapted  from  Scheff£  (1959).*  P«  353* 


Table  2 

New  Levels  of  Significance  When  Unequal  Population  Variances  Exist  Given 
"Large"  Sample.  Sizes  at  a  Nominal  5#  Significance  Level* 


Ratio  of  Variance  1  to  Variance  2 

Ratio  of  Sample  1 


to  Sample  2 

1:5 

1:2 

1:1 

2:1 

5:1 

1:1 

5# 

5# 

5# 

5# 

5# 

2:1 

12# 

(Biased 

8# 

on  "unsafe" 

5# 

side) 

2.9# 

(Biased  on 

1.4# 

"safe"  side) 

5:1 

22# 

12# 

5# 

1.4# 

0.2# 

♦Adapted  from  Scheffi  (1959)*  P»340 


Table  3 


New  Levels  of  Significance  When  Unequal  Population  Variances  Exist  in 
Three-  and  Five-group  Cases  (n  specified.-  and  small)  at  a  Nominal  5* 
Level  of  Significance,  Using  Schefffe's  "One-Way  Layout"  or  the 
Lindquistian  "Simple-Randomized"  Design 


No.  of  Ratio  of  Group  New  Level  of 

Groups  Variances  Sizes  Significance 

_ _ _ £“1 _ 


3 

1:2:3 

5/5,5 

5.6* 

3,9/3 

5.6* 

7, 5,3 

9-2* 

3, 5,7 

4.0* 

3 

1:1:3 

5,5,5 

5.9* 

7,5,3 

11.0* 

9/5,1 

17.0* 

1,5,9 

1.3* 

3 

25:100:225 

3,3,3 

7.3*  * 

10,10,10 

6.6*  *+ 

5 

1:1:1:1:3 

5,5,5, 5,5 

7.4* 

9, 5, 5, 5,1 

l4.o* 

1, 5, 5, 5, 9 

2.5* 

Adapted  from  ScheffA  (1959)/  P*  354. 

*  Adapted  from  Lindquist  (1953)/  p.84. 

+  Note  how,  as  n  increases,  levels  of  significance  tend  to  revert 
to  the  nominal  significance  level,  even  with  very  widely  discrepant 
variances . 


Table  4 

Values  for  Actual  Level  of  Significance  When  Making  Multiple  Comparisons 
at  a  Nominal  Level  of  Significance  of  5$  and  1 %  with  h  ml,  2, ....25 


Number  of 
Comparisons  (h] 


For  a  Nominal  5$ 
Level  of  Significance 


For  a  Nominal  1$ 
Level  of  Significance 


